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The identification of three-dimensional pharmacophores from large, heterogeneous data sets is achieved by combining a conformational 
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MEANS AND METHOD FOR RECURSIVE PARTITIONING ANALYSIS OF 
LARGE STRUCTURE-ACTIVITY DATA SETS 

RELATED APPLICATION 

The subject matter of this application is related to the subject matter of the 
commonly owned application PCT Serial Number WO98/47087, entitled "Statistical 
Deconvoluting of Mixtures" filed on April 1 7, 1998, the contents of which are incorporated 
by reference as if fully disclosed herein. 

BACKGROUND 

The recent progress of combinatorial chemistry and high throughput screening 
techniques has brought a revolution in the drug discovery processes in the pharmaceutical 
industry. It is now feasible to obtain biological activity data for thousands to hundreds of 
thousands of chemical compounds in a short period of time, leading to the tremendous 
increase of the quantity of data for the drug discovery cycle. However, analysis and 
utilization these data sources and/or conversion into a useful formats in a timely fashion is 
still a challenge for chemoinformatics. Among the many relevant problems, one is the 
automated pharmacophore identification for large, heterogeneous chemical data sets. 

A pharmacophore, usually defined as the key chemical features and the spatial 
relati j.isbips between them (configurations) associated with the biological activities of 
chemical compounds, is one of the most important concepts in medicinal chemistry and 
plays a critical role in the drug discovery process. Pharmacophore models can help 
medicinal chemists gain an insight on the key ligand-receptor interactions that are 
responsible for the biological activities, even when the receptor structure has not been 
determined. The models can be used as the search queries for pharmacophore search 
thereby assisting in the discovery of lead compounds. The models can also be used as the 
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initial step of 3D QSAR analysis by grouping the compounds that follow the same binding 
mode and indicating the possible 3D alignment rules. 

However, pharmacophore identification remains a process heavily dependent on 
medicinal chemists' experience and intuition. In most of the reported work on 
5 pharmacophore search, the search queries were taken from the literature or the crystal 
structures of receptor-ligand complexes. Over the past decade, several efforts, called 
automated pharmacophore identification/mapping/recognition, have been tried to involve 
more contributions from computational science into the pharmacophore identification 
process. 

10 Some of the algorithms and programs that have been specifically designed for this 

purpose include: active analogue approach, ensemble distance geometry, DISCO, 
Catalyst/Hypo, HipHop, Apex-3D, DANTE, and using ILP (Inductive Logic Programming) 
system Progol. However, each of these programs or algorithms suffer one or more of the 
following limitations: a) they are inherently limited to small data sets, which typically 

1 5 contain less than 50 compounds, since none of the programs were originally designed for 
large, heterogeneous chemical data sets; b) the programs only utilize the structural 
information provided by a small number of active compounds; c) most of these algorithms 
can not handle the situation of multiple binding modes, which are expected in large 
chon^al data sets. Preferably, any software designed to remedy these shortcomings can 

20 also complete this process in a reasonable amount of time enabling quicker identification of 
pharmacophore models. 

What is needed is a system and method for identifying three-dimensional 
pharmacophores from large, heterogeneous data sets, while utilizing structure and activity 
information from a large array of compounds while completing the process in a reasonable 

25 amount of time. 

SUMMARY OF INVENTION 

The present invention, whose software embodiment is referred to as SCAMPI (Statistical 
30 Classification of Activities of Molecules for Pharmacophore Identification), identifies three- 
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dimensional pharmacophores from large, heterogeneous data sets by combining fast 
conformation generation with recursive partitioning. This is an extension on the SCAM 
(Statistical Classification of Activities of Molecules) software system. The pharmacophore 
identification process runs recursively and the conformation spaces are re-sampled under the 
5 constraints of the evolving pharmacophore model. The present invention derives 
pharmacophore models from data sets up to 2000 compounds, with thousands of 
conformations generated for each compound. With the improvement in efficiency generated 
by the present invention, this process can be completed in less than one day of 
computational time. Thus, the present invention enables fast computation of 
10 pharmacophores from large, structurally heterogeneous data sets. The identified 

pharmacophores can then be used for drug design, as input to computational chemistry 
methods like 3D QSAR and the mathematical/in silica screening of large 3D databases of 
real or virtual compounds. 



1 5 BRIEF DESCRIPTION OF THE DRAWINGS 



Figure 1 provides an example of a band contact and the release procedure; 
Figure 2 is a flowchart illustrating the conformation search method of the present invention; 
Figure 3 illustrates the conformation generation procedure for a pseudo-compound using the 
20 present invention; 

Figure 4 is a flowchart of the general design of the software implementation of the present 
invention, referred to as SCAMPI, wherein the conformational and correspondence searches 
are combined; 

Figure 5 illustrates a tree generated by SCAMPI for the MAO data set; 
25 Figure 6 illustrates the pharmacophores for the MAO data set; 

Figure 7 illustrates a tree generated by SCAMPI for the ACE data set; and 
Figure 8 illustrates the pharmacophores for the ACE data set. 

Further illustrations are also included that provide an additional overview of the present 
invention. 



30 



) 
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DESCRIPTION 

Before going into the details of the software embodiment of the present invention, 
information concerning the search queries enabled by the present system and method is 
5 helpful. Generalized chemical features that are important in selective binding are usually 
used in the pharmacophore search queries. In the preferred embodiment, the definition 
system for such chemical features includes the following features as default: negative charge 
centers, positive charge centers, hydrogen bond acceptors, hydrogen bond donors, aromatic 
ring centers, hydrophobic centers, triple bond center, plus some explicit atom types for 

1 0 hetToatoms: N, O, S, P, F, and halogen (except F). To attain both ger rrality and efficiency, 
these chemical features are determined in two ways: substructure search using a fragment 
library as queries for the chemical features defined by multiple atoms, like guanidine, etc., 
and general rule-based searches for the chemical features defined by a single atom and its 
closest neighbors. Additional definitions can be added into the fragment library to include 

1 5 new chemical features. 

The pharmacophore search and identification process traditionally involves 
searching two spaces: the conformational space that represents all the reasonable 3D 
structures for each individual compound and the correspondence space that indicates all the 
common chemical feature matchings for a class of compounds. 

20 (!) Conformational search. 

As herein described, all the structure data for the conformational search is 
automatically generated by a topology analysis. During a topological analysis, the 
topological structure of each compound is decomposed into small units, which is the 
maximal subsets of atoms whose inter-atomic distances are invariant with respect to the 

25 torsional rotation of rotatable bonds. There are two types of units defined in SCAMPI: rigid 
groups and flexible rings. Rigid groups have no additional internal conformational freedom 
degrees, for example the aromatic ring, methylene group, and methyl group, while flexible 
rings can change their own conformations by ring flipping, such as the cyclohexane ring. 
The flippable corners on each flexible ring are also identified as the non-fusion ring atoms. 
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Once identified, these units are assigned sequential numbers that determine the order in 
which they will be assembled in the conformation build-up process. 

The conformational search implemented by the present system and method provides 
a number of advantages over the prior art methods. Traditionally, conformational search 
5 methods have tried to use a small set (usually tens) of conformers to completely cover the 
whole com format ional space of each compound. These methods are \ '>peful at best. The 
representative conformers were usually generated by some conformation search method 
followed by clustering analysis. Alternatively, some computational methods were designed 
to automatically generate diverse conformers, such as the "poling" method. Force field 

10 calculation and energy minimization are often used to generate low-energy conformations, 
typically within several kcal/mol of the global minimum. A Tripos-like force field was set 
up for the molecular mechanism calculations in SCAMPI. It includes terms for bond, angle, 
torsion, plane and van der Waals interactions, except that the electrostatic interaction term is 
excluded to simplify and speed the calculations. All the data structures necessary for force 

15 field calculations are also automatically generated by topology analysis. The drawbacks of 
this strategy are obvious. Tens of conformers may not be enough to ?.'.equately sample the 
conformational space of a highly flexible compound, receptor-bound conformations may not 
locate in the low-energy regions, and heavy computational burdens have traditionally 
prohibited the use of large data sets. The present invention, however, provides a number of 

20 means for improving the operation and efficiency of the sampling process enabling the use 
of larger, and therefore more accurate, data sets. Each improvement will be described 
separately but, in conjunction with one another, they serve to provide an entirely new range 
of acceptable sizes of the data sets that can be used. 

25 Referring now to Figures 2 and 3, The flow chart of the general conformational search 
procedure in SCAMPI is illustrated. 

Step 1 : If there are any flexible rings in the compound, each flexible ring is isolated by 
cutting off its side chain(s) but keeping its nearest side-chain neighbors. 
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In prior art methods, the conformational search of a flexible ring was transformed to 
conformational search of a chain by breaking one of the ring bonds and then adding a tight 
distance constraint between the two terminal atoms of that broken bond. Obviously, this 
kind of distance constraint needs to be a very tight constraint, and it can be expected that 

5 considerable computational effort is required for the successful ring closure. However, 
conformational search can be implemented in two different coordinate systems: Cartesian 
coordinate system where the Cartesian coordinates of each atom are directly perturbed, and 
internal coordinate system where the torsion angle of each rotatable bond is directly 
modified. Cartesian coordinates are more suitable for the conformational search of flexible 

10 rings, while internal coordinates are a natural choice for the non-ring parts of the structure. 
For this reason, the conformational search in SCAMPI is implemented in both Cartesian and 
internal coordinate systems, for searching conformations of flexible r;ngs and chains, 
respectively. 

1 5 Step 2: Each flexible comer of the isolated ring is randomly perturbed relative to the 

average ring plane. Several steps (20-100 steps) of energy minimization are then used to 
optimize the ring structure to a reasonable geometry. 

Step 3: All the side-chains are re-connected to the ring to form a complete compound. 
20 Another several steps (20-1 00 steps) of energy minimization are used to optimize the whole 
compounds structure to release the possible bad van der Waals contacts between side chains 
anc rvgr,. 

Step 4: Now, treating the flexible rings as rigid, the "differential distance equation" 
25 algorithm is sequentially used to find acceptable torsional ranges for each rotatable bond. 
After the acceptable torsional ranges of a rotatable bond are determined by the "differential 
distance equation" algorithm, the sampling points within such acceptable torsional ranges 
are randomly and uniformly picked up. The random search is performed in order to obtain a 
uniform sampling of the whole conformational space. More easily accessible conformations 
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are more likely to be shared by the active compounds at the binding site, and therefore 
should give stronger signal and have more chance to be selected by the statistical test. 
Description: 

The conformational search method applied in SCAMPI utilizes the "differential distance 
5 equation" to generate conformational structures that do not contain any bad van der Waals 
bumps between atoms. "Differential distance equation" algorithm is a look-ahead algorithm. 
It can determine the acceptable torsional ranges of a rotatable bond, which can lead to the 
partial conformational structure satisfying various distance constraints, such as the van der 
Waals distance constraints (and, as will become apparent, importantly the pharmacophore 
10 distance constraints) before it actually rotates that bond. Thereby it is much more efficient 
than the usual trial-and-error algorithms, where most of the computational efforts are spent 

to find the acceptable torsional angles. 1 

» 

Previously, the acceptable torsional ranges of a rotatable bond detennined by the 
intersection of acceptable torsional ranges of all the atom pairs that contain atoms on the 

1 5 both sides of that bond. Thus, if there are M atoms on one side of a rotatable bond and N 
atoms on the other side, we need repeat the calculation for about MxN times before we can 
get the final acceptable torsional ranges. In most cases, however, the final acceptable 
torsional ranges can be determined by only a small number of atom pars (usually far less 
than MxN). Thus, the acceptable torsional ranges of many atom pairs are [-tc, k], and 

20 therefore they need not be considered further in the intersection step. This results from the 
fact that the minimum accessible distances between the two atoms of such atom pairs are 
larger than their van der Waals distance constraints, so that it is not possible to have van der 
Waals contact between these atom pairs no matter how the bond rotates. 

Consequently, the present invention uses the minimum-accessible-distance calculation to 
25 examine all the atom pairs. This calculation is much faster than the calculation of the 

acceptable torsional ranges by the "differential distance equation". Only those atom pairs 
whose minimum accessible distances are less than their van der Waal? distance constraints 
will be further considered by the "differential distance equation" algorithm. After the 
acceptable torsional ranges of a rotatable bond are determined by the "differential distance 
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equation" algorithm, the sampling points within such acceptable torsional ranges are 
randomly and uniformly picked up. The inclusion of such a minimum-accessible-distance 
calculation as a filter significantly increases the computational speed. 

5 Step 4: (continued) If the acceptable torsional ranges are empty, the "release 

procedure" will be tried to relieve the existing bad contacts. If this release attempt fails, the 
whole build-up process will be started again from the first rotatable bond. To save the 
computational cost, the ring conformational search is completed only periodically, only 
after the chain conformational search has been successfully computed several times (default 
10 5). 

Description: One problem with the "differential distance equation" algorithm is that 
is can only look one-step ahead, which means that the conformation build-up process may 
become stuck with a bad conformation. As illustrated in Figure 1, the process is obviously 

15 stuck at the 5th rotatable bond. No matter how you rotate the 5th bond, there is always bad 
contact between the two terminal units. Prior art methods will return to sample the next 
sampling point of the 4th rotatable bond, to see if this bad contact can be released. If all the 
sampling points of the 4th rotatable bond have been tried and the bad contact still exists, the 
algorithm will go back to the 3rd rotatable bond. This backtracking strategy continues until 

20 the bad contact is released. At times, however, the bad contacts, as shown in Figure 1, 

could be released by rotating some rotatable bond between these two conflicting units. This 
release process can also be realized by using the "differential distance equation" algorithm 
with re-grouping the units, as illustrated to the 3rd bond in Figure 1 . When this release 
strategy fails to resolve bad contact problems, the present invention will restart the 

25 conformational build-up procedure from the 1 st rotatable bond. In most cases, this release 
strategy works very well and no further rebuild-up procedure is needed. 

(2) Correspondence search. 

The earliest pharmacophore identification strategies, like active analogue approach 
and ensemble distance geometry, avoided the correspondence search by requiring the user to 
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identify the correspondence relationships of the pharmacophore features among different 
active compounds. More recently, most of the strategies and programs depended on some 
pair-wise comparison algorithm to determine the common chemical features and 
configurations in all. of the active compounds. (Start with one pair of compounds and 
identify corresponding features.) This kind of strategy is computationally intensive and 
inherently limits itself to small data sets, since with n compounds at least n-1 times of pair- 
wise comparisons are needed, (assuming that the most active compound is used as one 
compound in the pair). 

The correspondence search method applied in SCAMPI is bai '.d on our previous 
recursive partition work using FIRM and SCAM programs (see included reference for 
further details). The method utilizes the information on the biological activities of all the 
compounds in the training set, and identifies the phaimacophore(s) by detecting the 
structural features that are most statistically significantly correlated with the biological 
activities. An easily interpreted dendrogram or tree diagram is also generated, in which the 
statistically best structural descriptors are used to split the large data set into smaller and 
more homogeneous subsets. The advantages of recursive partition strategy include: a) It is 
inherently fast when compared grouping with many other methods for grouping 
compounds; It overcomes the difficulties of handling nonlinear relationships and strong 
intentions in large SAR data sets; and c) It can detect the multiple mechanisms by 
separating the chemical compounds with different mechanisms into the different arms and 
terminal nodes of the dendrogram. 

The Student's t-test is used to recursively partition the whole data set into smaller 
and more homogeneous subsets, until each subset can not be split any longer. If the 
compounds are scored active/inactive rather than a continuous potency, then a chi-square 
test can be used rather than a t-test. For other methods of recursive partitioning, such as 
CART and C4.5, using the t-test (or chi-square.test) is the preferred method to split a node. 
A string of "binary" descriptors is generated at first to describe each compound, which 
indicate the presence or absence of a series of structural descriptors in a compound. Then, 
each one of all the structural descriptors is checked sequentially and the data set is split into 



WO 00/28429 



PCT/US99/25922 



10 

two subsets according to whether or not that descriptor is in the structure. The Student's t- 
test is computed according to the following formula: 

— M 

M J_ I ssx+ssy 

\M + N'i M + N-2 

where 

ssx^x.-xf 
ssy^y,-?) 2 

I- 1 

X„ X,, . . . X M are the activities in the first subset, and Y„ Y,, ... are the activities in 
the second subset. M and N are respectively the numbers of compounds in these two 
subsets. The structural descriptor that gives the largest t-value is chosen as the best 
descriptor for the split, if its corresponding Bonferroni adjusted p- value is smaller than some 
pre-sct termination criterion (default 0.01). The Bonferroni adjustment multiplies the raw 
Student t-test's p-value by the number of variables under consideration, taking into account 
the number of statistical tests in order to avoid the increased probability of a false positive 
split. 

Since multiple conformations have been generated for each compound, the method 
assigns the absence of a particular descriptor to a compound if none of its conformations 
contains that descriptor, otherwise, if any of its conformations contains that particular 
descriptor, the method assign the presence of it to that compound. This is akin to a Boolean 
OR operation on all the conformation "binary" descriptor strings of a compound, forming 
the compound "binary" descriptor string, which takes the form of a bit, 0/1, string. 

The pharmacophore model construction procedure of the pres :at invention proceeds 
in the following fashion: a two-point pharmacophore is searched first and then the new 
pharmacophore point is searched and added if it is found. This process continues adding a 
single point at a time until no more statistically significant pharmacophore points can be 
found. Correspondingly, two-point structural descriptors are generated for each compound 
at first, which are composed of two chemical features and the "binned" distance between 
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them. After the most significant two-point descriptor has been determined, new three-point 
structural descriptors are generated for each compound. These three-point structural 
descriptors retain the features and distance in the most significant two-point descriptor that 
has been found, and therefore are actually composed of only the third chemical feature and 
5 its two "binned" distances to each one of the former two chemical features. Following the 
same rule, the newer descriptors are composed of a newer chemical feature and its "binned" 
distances to all of the former chemical features that have been determined. This process is 
iterative and stops when no new pharmacophore features can be found. 

There are two kinds of possible splits at each split point: positive and negative splits. 
10 A \f jritiv-5 split is one in which the sub-node that contains the split de/ jriptor is the more 
active node, i.e., the subset of compounds containing the split descriptor is more active on 
average than the subset of compounds not containing that split descriptor, otherwise it is a 
negative split. The software implementation of the present invention uses the positive split 
as the default split method. Although information on the excluded volume could be 
1 5 determined by enabling selection of negative splits, the software implementation has the 
advantage of simplicity as the negative splits are often difficult to select using classic 
pharmacophore modeling. The default is positive splits only, but the user can ask for 
positive and negative splits. 

(3) Combining conformational search and correspondence rearch. 
Previous methods did not combine conformational search and correspondence search 
together. The present invention combines the conformational generation and 
correspondence search together. The sampling completeness of the descriptors is used as 
the criterion to terminate each round of conformational search, because further 
conformational searches will not add more information for the following statistical test since 
that particular descriptor space has been completely sampled. On the other hand, the 
pharmacophonic descriptors that have been identified are imposed as the additional 
constraints in the next-round conformational search, so that the sampling in those important 
conformational subspaces can be more thorough. 



20 



25 
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The general design of the software implementation of the present invention, 
SCAMPI, wherein the conformational and correspondence searches are combined is 
illustrated as the flow chart in Figure 4 and described as the following steps. 

Step I: The first-round conformational search is done for each compound without 
any constraints except the chemical structure itself. The search completeness is judged by 
monii jr-'.g the "extended" two-point descriptors, which is defined as r (chemical feature 
ID)-(binned distance)-(chemical feature ID)". When there is no new "extended" two-point 
descriptor found for continuous n (default 100) times, it is assumed that this descriptor 
space has been completely sampled. All the "extended" two-point descriptors are then 
converted to the "standard" two-point descriptors as we have described, which has the form 
as "(chemical feature type)-(binned distance)-(chemical feature type)", for the following 
recursive partition analysis. During the conformational search process, all the generated 
conformations are saved in the memory space for the further use. 

Step 2: Student's t-test is used to analyze the two-point descriptor space, to find the 
most significant two-point descriptor to split compounds into the more active and less active 
group.:. T** such a descriptor is found, the two chemical features and Vis distance between 
them in this descriptor will be treated as the first two pharmacophoric points, and the whole 
data set be divided into two subsets: the compounds containing this two-point descriptor and 
the compounds not containing it. 

Step 3: The Ullman algorithm is used to search all the saved conformations of the 
compounds in the positive subset, using the most significant two-point descriptor as the 
search query. The entities of the pairs of chemical feature points, which satisfy the search 
query, are identified and recorded for the following constrained conformational search. 

Step 4: The second-round conformational search is done for all the compounds in the 
positive subset. The "binned" distance in the most significant two-point descriptor is added 
to the pair of chemical features points, which have been identified previously by Ullman 
algorithm, as an additional distance constraint. Then, like in steps 1 and 2, three-point 
descriptors are generated during the constrained conformational search, and the search 
completeness is judged by monitoring "extend" three-point descriptors. Students t-test is 
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used again to analyze the "standard" three-point descriptors in order to find the third most 
significant pharmacophoric point. 

Step 5: For the compounds in the negative subset, Students t-test is used again to 
analyze the former "standard" two-point descriptors in order to find a next most significant 
5 two-point descriptor. 

Step 6: The above steps are repeated as shown in Figure 4, until no more significant 
pharmacophoric points can be found or the default maximal number of pharmacophoric 
points (presently, set at 5 points) has been attained. 

10 As a further refinement on the implementation of the invention, sparse matrix techniques are 
utilized throughout in order to conserve memory. Lists of structures where descriptors are 
found, instead of lists of descriptors that are found in those structures, are stored for the 
recursive partitioning analysis. Hash-table search is used to insert all the descriptors into 
the SAR table. Furthermore, dynamic memory management is widely applied to optimize 

15 the memory utilization and also increase the computational speed since we avoid saving a 
large amount of conformational information on hard disk. 

Examples: 

(1) Monoamine oxidase (MAO) inhibitors. 

20 Of the 1 ,650 compounds in the original MAO data set provided by Abbott 

Laboratories, CONCORD successfully converted 1,644 compounds from the 2D structures 
to the 3D structures. The structures and activities of these 1,644 comj mnds were then used 
as the input for SCAMPI. 

With a single run of SCAMPI, a recursive partition tree was generated as shown in 

25 Figure 5. The computational time was about 1.2 CPU hours and a total of 400,033 

conformations were generated in the entire process. From Figure 5, we can find two major 
active nodes, shaded in shadow. The two corresponding pharmacophores are illustrated in 
Figure 6. The first one contains an aromatic ring center, a triple bond center and a positive 
charge center on nitrogen. The second one contains two hydrogen bond donors on nitrogen, 
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with p. perfectly correlated carbonyl group at the adjacent position; this group of atoms 
forms a hydrazide feature. 

These two pharmacophore models are supported by previous experimental work. 
Hydrazide MAO inhibitors (e.g., compound AL1 6432 in Figure 6) can be hydrolyzed to 
5 acetylhydrazines that act as the non-selective, irreversible inhibitors to colvantly bind to 
various macromolecules including MAO. Propargylamines (e.g., compound AL19120 in 
Figure 6) are themselves suicide inhibitors that irreversibly inhibit MAO through covalent 
attachment to its flavin cofactor. Simultaneously finding features governing these two 
mechanisms is a clear demonstration that SCAMPI has the capability to detect multiple 
10 mechanisms of action co-existing in a large chemical data set. 

(2) Angiotensin-converting enzyme (ACE) inhibitors. 

The ACE data set is composed of 1 14 ACE inhibitors provided by Triops Inc. and 
932 compounds randomly picked up from WDI (World Drug Index) database to act as 
15 negative compounds. The biological activities of 1 14 ACE inhibitors are expressed as 

continuous pIC 50 values and the WDI compounds are arbitrarily assigned a pIC 50 of 0. The 
structures and activities of these 1,046 compounds were then used as the input for SCAMPI. 

With a single run of SCAMPI, a recursive partition tree was generated as shown in 
Figure 7. The computational time was about 8.1 CPU hours and a total of 573,798 
20 conformations were generated in the entire process. From Figure 7, we can find two major 
active nodes, shaded in shadow. The two corresponding pharmacophores are illustrated in 
Figure 8 The first pharmacophore contains a negative charge center located on carboxylate 
group, an oxygen atom on carbonyl group, another negative charge center on carboxylate 
group, and a nitrogen atom. The second pharmacophore contains a negative charge center 
25 located on carboxylate group, an oxygen atom on carbonyl group, and a sulfur atom in 
thiolate group. 

A comparison of the two pharmacophores in Figure 8 shows the similarity between 
the first three points in these two pharmacophores. They share the same geometry and two 
of three pharmacophore feature types. A literature search indicates they do follow the 
30 same binding mode, by carboxylate and thiolate binding to the same zinc atom in ACE. The 
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commonly acceptable pharmacophore is composed of a negative charge center, an oxygen 
as hydrogen bond acceptor, and a zinc binding site. Because we didn't define a special 
chemical feature for zinc binding site, SCAMPI split the compounds following this binding 
mode into two different terminal nodes. As to the fourth point, a nitrogen atom, in the first 
pharmacophore, the statistical test indicated that it significantly contributes to the binding. 
This example demonstrates again that SCAMPI can quickly find the pharmacophore 
consistent with the known result. 

In the preferred embodiment of the present invention, the system and method are 
implemented on a computer 900 as illustrated in figure 9. The computer comprises a central 
processing unit (CPU) 902 for performing the calculations of the described methods, a 
storage device 908 for storing data and files that can be retrieved by the processor, an input 
device 904 enabling user interaction with the computer, a display device 906, and dynamic 
memory 919 for storing one or more programs during execution, such as the program that 
performs the above method. Alternatively, the system and method could be implemented 
across a network of computers, enabling the program to be run by multiple processors at 
separate physical locations. 

The above description is included to illustrate the operation of the preferred 
embodiments and is not meant to limit the scope of the invention. From the above 
discussion, many variations will be apparent to one skilled in the art that would yet be 
encompassed by the spirit and scope of the present invention. 
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I claim: 

1. A conformational search method, wherein the steps of performing a conformational 
search comprise the steps of: 

isolating each flexible ring in a compound by cutting off one or more side chains 
while keeping the side chain neighbors nearest the flexible ring; 

perturbing every flexible comer of the isolated ring relative to an average ring plane; 
reconnecting the chains cut off in the first step; 

calculating torsional ranges for each rotatable bond using the differential distance 
equation; 

responsive to the identification of points within the calculated torsional ranges, 
uniformly sampling the identified points; and 
storing the sampled conformations. 

2. The method of claim 1, further comprising the steps of, responsive to the calculated 
torsional range failing to meet the van der Waals distance constraints: 

regrouping atoms of the compound; 
calculating torsional ranges for each rotatable bond; and 
responsive to a rotatable bond of the regrouped atoms meeting the van der Waals 
distance constraints, storing the conformation. 

3. The method of claim 1, further comprising prior to calculating the torsional ranges, the 
additional step of removing compounds that have atom pairs whose minimum accessible 
distances are less than their van der Waal distance constraints. 

4. The method of claim 1, wherein the conformational search is implemented in the 
Cartesian coordinate system. 



SUBSTITUTE SHEET (RULE 26) 



WO 00/28429 



PCT/US99/25922 



17 

5. The method of claim 1, wherein the conformational search is implemented in the internal 
coordinate system. 

6. The method of claim 1 , wherein the step of perturbing fiirther Comprises the step of 
5 applying several steps of energy minimization to optimize the ring structure. 

7. The method of claim 6, wherein the step of reconnecting the side chains further comprises 
the step of applying several steps of energy minimization, wherein the optimization serves 
to optimize the structure of the compound and releases any bad van dur Waals contacts 

10 between the side chains and rings. 

8. A method for automating the identification of pharmacaphores from a large 
heterogeneous data set, the method comprising the steps of: 

performing a conformational search for each compound; 
15 storing all conformations generated during the conformational search; 

searching for a most significant two-point descriptor using a t-test to analyze the 
two-point descriptor space; 

responsive to finding the most significant two point descriptor: 

storing two chemical features and distance between said features of the most 
20 significant two point descriptor, and 

dividing all conformations into subsets according to the presence or absence 
of the stored two point descriptor; 

searching all saved conformations of the compounds for which said the stored two 
point descriptor is present using the Ullman algorithm; 
25 responsive to the presence of one or more conformations that satisfy the search in 

the prior step, performing a second conformational search on said conformations wherein 
the stored distance for the most significant two-point descriptor is included as an additional 
distance constraint; 
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responsive to one or more three-point descriptors generated during second 
conformational search, searching said three-point descriptors for a third most significant 
pharmacophore point; and 

performing a third conformational search on conformations wNch did not contain 
the most significant two point descriptor in order to locate a next most significant two point 
descriptor. 

9. The method of claim 8, wherein the method steps are repeated until no more 
significant pharmacaphoric points has been attained. 

1 0. The method of claim 8, wherein the method steps are repeated until a default number 
of significant pharmacaphoric points has been attained. 

1 1 . The method of claim 1 0, wherein the default number of significant points is five. 

12. The method of claim 8, wherein the steps of storing includes storing conformations 
includes storing conformations using sparse matrix techniques. 

13. The method of claim 8, wherein the step of storing includes storing data using 
dynamic memory management techniques. 

14. The method of claim 8, wherein the steps of performing a conformational search . 
comprises performing the conformational search of claim 1 . 

1 5. The method of claim 8, wherein the steps of performing a conformational search 
comprises performing the conformational search of claim 7. 

16. A system for automating the identification of pharmacaphores from a large 
heterogeneous data set, the system comprising: 
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means for performing a conformational search for each compound in the data set; 
coupled to the means for performing, means for storing all conformations generated 
during the conformational search; and 

coupled to the means for storing, means for searching for the most significant 
5 descriptors in the descriptor space. 

1 7. The system of claim 1 6, wherein the means for performing the conformation search 
includes: 

a computer processor unit (CPU); and 

a magnetic memory device having stored instructions which enable execution of the 
10 method of claim 1. 

18. The system of claim 1 6, wherein the means for searching for the most significant 
descriptors includes means for performing a t-test. 

15 19.. A computer-readable medium containing a computer program for automating the 
identification of pharmacaphores from a large heterogeneous data set, said program 
containing instructions for directing the computer to execute the steps of: 

performing a conformational search for each compound; 
20 storing all conformations generated during the conformational search; 

searching for a most significant two-point descriptor using a t-test to analyze the 
two-point descriptor space; 

responsive to finding the most significant two point descriptor: 

storing two chemical features and distance between said features of the most 
25 significant two point descriptor, and 

dividing all conformations into subsets according to the presence or absence 
of the stored two point descriptor, 

searching all saved conformations of the compounds for which said the stored two 
point descriptor is present using the Ullman algorithm; 
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responsive to the presence of one or more conformations that r utisfy the search in 
the prior step, performing a second conformational search on said conformations wherein 
the stored distance for the most significant two-point descriptor is included as an additional 
distance constraint; 

5 responsive to one or more three-point descriptors generated during second 

conformational search, searching said three-point descriptors for a third most significant 
pharmacophoric point; and 

performing a third conformational search on conformations which did not contain 
the most significant two point descriptor in order to locate a next most significant two point 
1 0 descriptor. 
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